A typical data science project:
Visualisation (Ch. 3) is a great place to start with R programming: the payoff is immediate.
Data transformation (Ch. 5) deals with key verbs that allow you to select important variables, filter out key observations, create new variables, and compute summaries.
In exploratory data analysis (Ch. 7), we’ll combine visualisation and transformation in order to ask and answer interesting questions about data.
In this chapter we will learn how to visualise data using ggplot2, a part of the tidyverse.
Install tidyverse, if you have not:
install.packages("tidyverse")After installation, load tidyverse by
library("tidyverse")
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.2 ✔ purrr 1.0.1
## ✔ tibble 3.2.1 ✔ dplyr 1.1.2
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()mpg data framempg data frame can be found in the ggplot2
package (aka ggplot2::mpg): mpg
displ: engine size, in litres.hwy: highway fuel efficiency, in mile per gallen
(mpg).Scatterplot of hwy vs displ:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy))
ggplot() creates a coordinate system that you can
add layers to.
First argument of ggplot() is the dataset to use in
the graph.
Function geom_point() adds a layer
of points to your plot.
mapping argument: defines how variables in your
dataset are mapped to visual properties.
aes()x and y arguments of aes()
specify which variables to map to the x and y axes.ggplot2 looks for the mapped variable in the data
argument, in this case, mpg.
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))
How can you explain the red dots?
You can map the colors of your points to the class
variable to reveal the class of each car:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
What can you say about the red dots in the previous plot?
Assign different sizes to points according to
class:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, size = class))
Assign different transparency levels to points according to
class:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.
Assign different shapes to points according to
class:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, shape = class))
R has 25 built in shapes that are identified by numbers. Beware some
seeming duplicates. The difference comes from the interaction of the
colour and fill aesthetics
Set the color of all points to be blue:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Another way to add additional variables
Facets divide a plot into subplots based on the values of one or more discrete variables.
A subplot for each car type, facet_wrap():
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
A subplot for each car type and drive,
facet_grid():
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ class)
geom_smooth(): smooth lineHow are these two plots similar?
# left
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) # point geom
# right
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy)) # smooth geom
They use different geoms.
Recall that every geom function in
ggplot2 takes a mapping argument.
Different line types according to drv:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))
Different line colors according to drv:
ggplot(data = mpg) +
geom_smooth(mapping = aes(x = displ, y = hwy, color = drv))
Plot containing two geoms in the same graph!
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
geom_smooth(mapping = aes(x = displ, y = hwy))
Same as
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() + geom_smooth()
Different aesthetics in different layers:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
Compare this with
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth()
(You’ll learn how filter() works in the next chapter:
for now, just know that this command selects only the subcompact
cars.)
diamonds data diamonds
nrow(diamonds)
## [1] 53940
Total number of diamonds in the diamonds dataset,
grouped by cut:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut)) # a new geom
count is not a variable in diamonds
.
Bar charts, like histograms, frequency polygons, smoothers, and boxplots, plot some computed variables instead of raw data.
New values are computed via statistical transformations (stats).
Check available computed variables for a geometric object via help:
?geom_bar?geom_bar shows that the default value for
stat is “count”, which means that geom_bar()
uses stat_count().
Use stat_count() directly:
ggplot(data = diamonds) +
stat_count(mapping = aes(x = cut))
stat_count() has a default geom
geom_bar().
Display relative frequencies instead of counts:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
Custom stat:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
Color bar:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, colour = cut))
Fill color:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = cut))
Fill color according to another variable:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity))
The stacking is performed automatically by the position
adjustment specified by the position argument. The
default behaviour is to stack bars on top of each other. The
following code (you do not need to know the details now) shows the
counts of clarity categories of the “Ideal” cut:
diamonds %>% filter(cut == "Ideal") %>% select(clarity) %>% table()
## clarity
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 146 2598 4282 5071 3589 2606 2047 1212
See the heights of bars are proportional to these counts. (“I1” takes too small porportion and hard to see in the chart.)
If you don’t want a stacked bar chart, you can use one of three other
options: "identity", "dodge" or
"fill".
position_identity() place each object exactly where
it falls (that is, each bar start from 0 are are is superposed on each
other):
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "identity")
position="identity" is a shorthand for
position_identity().
position_dodge() arrange elements side by side:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")
This is like
position_identity() spread over the
x-axis.
position_fill() stack elements on top of one
another, like the default behaviour, but normalize height:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
position_stack() recovers the default plot:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "stack")
position_jitter() add random noise to X and Y
position of each element to avoid overplotting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")
geom_jitter() is a shorthand for
geom_point(position = "jitter"):
A boxplot:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()
coord_cartesian() is the default cartesian
coordinate system:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_cartesian(xlim = c(0, 5))
coord_fixed() specifies aspect ratio:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_fixed(ratio = 1/2)
coord_flip() flips x- and y- axis:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot() +
coord_flip()
A map:
install.packages("maps") # need to install this package
library("maps")
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
nz <- map_data("nz")
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black")
coord_quickmap() puts maps in scale:
ggplot(nz, aes(long, lat, group = group)) +
geom_polygon(fill = "white", colour = "black") +
coord_quickmap()
coord_polar() uses polar coordinates.
bar <- ggplot(data = diamonds) +
geom_bar(
mapping = aes(x = cut, fill = cut),
show.legend = FALSE,
width = 1
) +
theme(aspect.ratio = 1) +
labs(x = NULL, y = NULL)
bar + coord_flip()
bar + coord_polar()
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
Figure title should be descriptive:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(title = "Fuel efficiency generally decreases with engine size")
Figure subtitle and caption can be more in detail
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
labs(
title = "Fuel efficiency generally decreases with engine size",
subtitle = "Two seaters (sports cars) are an exception because of their light weight",
caption = "Data from fueleconomy.gov"
)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_smooth(se = FALSE) +
labs(
x = "Engine displacement (L)",
y = "Highway fuel economy (mpg)"
)
df <- tibble(x = runif(10), y = runif(10))
ggplot(df, aes(x, y)) + geom_point() +
labs(
x = quote(sum(x[i] ^ 2, i == 1, n)),
y = quote(alpha + beta + frac(delta, theta))
)
?plotmath
Create labels
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
best_in_class
## # A tibble: 7 × 11
## # Groups: class [7]
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 chevrolet corvette 5.7 1999 8 manu… r 16 26 p 2sea…
## 2 dodge caravan 2wd 2.4 1999 4 auto… f 18 24 r mini…
## 3 nissan altima 2.5 2008 4 manu… f 23 32 r mids…
## 4 subaru forester a… 2.5 2008 4 manu… 4 20 27 r suv
## 5 toyota toyota tac… 2.7 2008 4 manu… 4 17 22 r pick…
## 6 volkswagen jetta 1.9 1999 4 manu… f 33 44 d comp…
## 7 volkswagen new beetle 1.9 1999 4 manu… f 35 44 d subc…Annotate points
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(colour = class)) +
geom_text(aes(label = model), data = best_in_class)
ggrepel package automatically adjust labels so that
they don’t overlap:
install.packages("ggrepel")
library("ggrepel")
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
geom_point(size = 3, shape = 1, data = best_in_class) +
ggrepel::geom_label_repel(aes(label = model), data = best_in_class)
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class))
automatically adds scales
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()breaks
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_y_continuous(breaks = seq(15, 40, by = 5))
labels
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
scale_x_continuous(labels = NULL) +
scale_y_continuous(labels = NULL)
Plot y-axis at log scale:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_y_log10()
Plot x-axis in reverse order:
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
scale_x_reverse()
Set legend position: "left", "right",
"top", "bottom", none:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(colour = class)) +
theme(legend.position = "left")
See following link for more details on how to change title, labels, … of a legend.
Without clipping (does not remove unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
coord_cartesian(xlim = c(5, 7), ylim = c(10, 30))
With clipping (removes unseen data points)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
xlim(5, 7) + ylim(10, 30)
ggplot(mpg, mapping = aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth() +
scale_x_continuous(limits = c(5, 7)) +
scale_y_continuous(limits = c(10, 30))
mpg %>%
filter(displ >= 5, displ <= 7, hwy >= 10, hwy <= 30) %>%
ggplot(aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth()
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE) +
theme_bw()
The eight themes built-in to ggplot2.
ggplot(mpg, aes(displ, hwy)) + geom_point()
ggsave("my-plot.pdf")
## Saving 5 x 3.5 in image
ggplot2 provides over 30 geoms, and extension packages provide even more (see https://exts.ggplot2.tidyverse.org for a sampling). The best way to get a comprehensive overview is the ggplot2 cheatsheet, which you can find at https://rstudio.com/resources/cheatsheets/.